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Abstract 

We propose and develop a novel and effective perfect sampling methodology for simulating 
from posteriors corresponding to mixtures with either known (fixed) or unknown number of 
components. For the latter we consider the Dirichlet process-based mixture model developed 
by these authors, and show that our ideas are applicable to conjugate, and importantly, to non- 
conjugate cases. As to be expected, and as we show, perfect sampling for mixtures with known 
number of components can be achieved with much less effort with a simplified version of our 
general methodology, whether or not conjugate or non-conjugate priors are used. While no 
special assumption is necessary in the conjugate set-up for our theory to work, we require 
the assumption of compact parameter space in the non-conjugate set-up. However, we argue, 
with appropriate analytical, simulation, and real data studies as support, that such compactness 
assumption is not unrealistic and is not an impediment in practice. Not only do we validate our 
ideas theoretically and with simulation studies, but we also consider application of our proposal 
to three real data sets used by several authors in the past in connection with mixture models. 
The results we achieved in each of our experiments with either simulation study or real data 
application, are quite encouraging. However, the computation can be extremely burdensome 
in the case of large number of mixture components and in massive data sets. We discuss the 
role of parallel processing in mitigating the extreme computational burden. 
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1. INTRODUCTION 

Markov chain Monte Carlo (MCMC) algorithms are developed to simulate from desired distribu- 
tions, from which generation of exact samples is difficult. The methodology has found much use 
in the Bayesian statistical paradigm thanks to the natural need to sample from intractable posterior 
distributions. But in whatever clever way the MCMC algorithms are designed, the samples are gen- 
erated only approximately. Due to impossibility of running the chain for an infinite span of time, 
a suitable burn-in period is chosen, usually by a combination of empirical and ad-hoc means. The 
realizations retained after discarding the burn-in period are presumed to closely represent the true 
distribution. The degree of closeness, however, depends upon how suitably the burn-in is chosen, 
and an arbitrary choice may lead to serious bias. Even in simple problems non-negligible biases 
often result if the burn-in period is chosen inadequately (see, for example, Roberts & Rosenthal 
(1998)). Such problems can only be aggravated in the case of realistic, more complex models, such 

as mixture models of the form, given for the data point y, by 

v 

[ y \e p ,u p ] = J2^f(y\^), (i) 

In ([TJ), 6 P denotes the set of parameters (8i, . . . , 6 P )', Tl p = . . . , ix p )' are the mixing probabili- 
ties such that TTj > for j = 1, . . . ,p, and Y7j=i n j = 1- Here the number of mixture components 
p may or may not be known. The latter case corresponds to variable dimensional parameter space 
since the cardinality of the set P then becomes random. 

Mixture models form a very important class of models in statistics, known for their versatil- 
ity. The Bayesian paradigm even allows for random number of mixture components (making the 
dimensionality of the parameter space a random variable), adding to the flexibility of mixture mod- 
els. Sophisticated MCMC algorithms are needed for posterior inference in mixture models, raising 
the question of adequacy of the available practical convergence assessment methods, particularly 
in the case of variable-dimensional mixture models. The importance of the aforementioned class 
of models makes it important to solve the associated convergence assessment problem. In this 
paper, we develop a rigorous solution to this problem using the principle of perfect sampling. 

The perfect sampling methodology, first proposed in the seminal paper by Propp & Wilson 
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(1996), attempts to completely avoid the problems of MCMC convergence assessment. In prin- 
ciple, starting at all possible initial values, so many parallel Markov chains need to be run, each 
starting at time t = — oo. If by time t — 0, all the chains coalesce, the coalescent point at time 
t = is an exact realization from the stationary distribution. Essentially, this principle works in the 
same way as the regular MCMC algorithms, but by replacing its starting time t — with t = — oo 
and the convergence time t = oo with t = 0. To achieve perfect sampling in practice, Propp & 
Wilson (1996) proposed the "coupling from the past" (CFTP) algorithm, which avoids running 
Markov chains from the infinite past. We briefly describe this in the next section. 

2. THE CFTP ALGORITHM 
For the time being, for the sake of clarity, following Propp & Wilson (1996), let us assume that 
the state space X is finite, and let {X t ; t — 0, 1, . . .} denote the underlying Markov chain. Then, 
for t > it is possible to represent the Markov chain generically as a random mapping: X t+1 = 
4>t{X t ) = (f)(X t , Rt+i), for some function •) and an iid sequence {R t ; t = 1, . . .}. Then the 
CFTP algorithm is as follows (see Propp & Wilson (1996), Robert & Casella (2004)): 

1. For t = — 1, —2, . . ., generate </>t(x) for x E X. 

2. For t — — 1, —2, . . ., for x G X, define the compositions 

M x ) = 0o o 0_i o • • • <f)_ t (x) (2) 

3. Determine the time T such that $ T is constant. 

4. Accept & T (x*) as an exact realization from the stationary distribution for any arbitrary x* G 
X. 

It is well-known (see, for example, Casella, Lavine & Robert (2001)) that the above algorithm 
terminates almost surely in finite time under very mild conditions and indeed yields a realization 
distributed exactly according to the stationary distribution of the Markov chain. Propp & Wilson 
(1996) recommend taking t = —2 j , for j = 1,2, . . ., which we shall adopt in this paper. A 
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subtle, but important point is that, even if all the Markov chains coalesce before time t — 0, the 
corresponding simulation at the time of coalescence need not yield a perfect sample. One needs to 
carry the algorithm forward till time t = 0; the sample corresponding to only t = is guaranteed 
to be perfect. For details, see Casella et al. (2001). 

Although in the seminal paper of Propp & Wilson (1996) the CFTP algorithm, as described 
above, was constructed assuming finite state space in the above algorithm, later developments 
managed to circumvent this assumption of finiteness. Indeed, strategies for perfect sampling in 
general state spaces are described in Murdoch & Green (1998) and Green & Murdoch (1999), but 
quite restricted set-ups, which do not hold generally, are needed to implement such strategies. The 
set up of mixture models is much complex, and the known strategies are difficult to apply. 

The first attempt to construct perfect sampling algorithms for mixture models is by Hobert, 
Robert & Titterington (1999). However, they assumed only 2-component and 3-component mix- 
ture models, where only the mixing probabilities are assumed to be unknown. Bounding chains 
(this term seems to first appear in Huber (1998); Huber (2004) also uses this term) with mono- 
tonicity structures are used to enable the CFTP algorithm in these cases. Using the principle of the 
perfect slice sampler (Mira, M0ller & Roberts (2001)), and assuming conjugate priors on the pa- 
rameters, Casella, Mengersen, Robert & Titterington (2002) proposed a perfect sampling method- 
ology for mixtures with known number of components by marginalizing out the parameters. It is 
noted in Casella et al. (2002) that in the conjugate case the marginalized form of the posterior is an- 
alytically available, but the authors point out (see Section 2 of Casella et al. (2002)) that still perfect 
simulation from the analytically available marginalized posterior is important. Unfortunately, apart 
from the somewhat restricted assumptions of conjugate priors and known number of components, 
the methodology is approximate in nature and the authors themselves demonstrated that the ap- 
proximation can be quite poor. Fearnhead (2005) proposed a direct sampling methodology based 
on recursion relations associated with the forward-backward algorithm, for mixtures of discrete 
distributions assuming a conjugate set-up and known number of components, thus bringing in an 
extra and crucial assumption of discrete data. Most recently, Berthelsen, Breyer & Roberts (2010) 
introduced a new perfect sampling methodology in mixtures with known number of components 
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where only the mixture weights are unknown. The method is also shown to work for normal mix- 
tures (with known number of components) with unknown weights as well as with unknown means, 
but with a known, common variance. However, even in this much restricted setup, the practicality 
of the algorithm is challenged. Indeed, the authors honestly remarked in pages 255-256 the fol- 
lowing: "Unfortunately, in practice this extension of our algorithm seems to be feasible only for 
fairly small data sets (n < 5) where the data consist of one cluster." 

However, the drawbacks of the methodologies in no way present the contributions of the afore- 
mentioned authors in poor light, these only show how difficult the problem is. In this paper we at- 
tempt to avoid the restrictions and difficulties by proposing a novel approach. In the non-conjugate 
case (but not in the conjugate case) we are forced to assume compactness of the parameter space, 
but we argue in Section [3J| followed up with a simulated data example in the supplement and three 
real data cases in Section |5j that it is not an unrealistic assumption, particularly in the Bayesian 
paradigm. Noting particularly that no methodology exists in the literature that even attempts per- 
fect simulation from mixtures with unknown number of components, for either compact or non- 
compact parameter space, for either conjugate or non-conjugate set-up, there is no reason to look 
upon our compactness assumption only in the non-conjugate case as a serious drawback. 

We first construct a perfect sampling algorithm for mixture models with fixed (known) number 
of components and then generalize the ideas to mixtures with unknown number of components. 
For the sake of illustration, we concentrate on mixtures of normal densities, but our ideas are 
quite generally applicable. We illustrate our methodology with simulation studies as well as with 
application to three real data sets. Additional technical details and further details on experiments 
are provided in the supplement, whose sections and figures have the prefix "S-" when referred to 
in this paper. 
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3 . PERFECT SAMPLING FOR NORMAL MIXTURES WITH KNOWN NUMBER OF 

COMPONENTS 
3 . 1 Normal mixture model and prior distributions 

Letting /(■ | in (jTJ) denote normal densities with mean /jLj and variance a 2 , we obtain the 
following normal mixture model 

A 



/A ■ f 

.7=1 ^ 



f (v - !> , (3) 



In (|3j), = (/ij, A-,), where Aj = 2 . For the sake of convenience of illustration only we consider 
the following conjugate prior specification on the unknown variables 

Xj ~ Gamma(ri/2, C/2); j = l,...,p (4) 
~ N(£ j ,Tj\J 1 y,j = l,...,p (5) 
lip = (71-1, . . . ,7T P ) ~ Dirichlet(yi,...,y p ) (6) 

(7) 

In Q, Gamma(r]/2, (/2) denotes the Gamma distribution with density proportional to 

Xf 2 ' 1 exp {-(Xj/2}. We further assume that {77, C}, {6, • • • , {n> • • • , r P } and {7i, • • • , 7 P } 

are known. 

With conjugate priors the marginal posteriors of the parameters (IL P , Q p ) and the allocation 
variables Z are available in closed forms, but still sampling from the posterior distributions is im- 
portant. Indeed, Casella et al. (2002) argue that sampling enables inference on arbitrary functionals 
of the unknown variables, which are not analytically available. These authors proposed a perfect 
slice sampler for sampling from the marginal posterior of the allocation variable Z only. Given 
perfect samples from the posterior of Z, drawing exact samples from the posterior distributions 
of (lip, Q p ) is straightforward. But importantly, the posteriors are not available in closed forms in 
non-conjugate situations, and even Gibbs sampling is not straightforward in such cases. Since our 
goal is to provide a general theory that works for both conjugate and non-conjugate priors, we do 
not focus on the marginalized approach, although the conjugate situation is just a special (and sim- 



pler) case of our proposed principle (see Sections 3.3 and 4.5 ). Due to convenience of illustration 



we begin with the conjugate prior case where the full conditional distributions needed for Gibbs 
sampling are available. It will be shown how the same ideas are carried over to the non-conjugate 
cases. 



3.2 Full conditional distributions 

Assuming that a dataset Y — (y 1 , . . . , y n )' is available, let us define the set of allocation variables 
Z = (zi, . . . , z n )' , where z% = j if yi comes from the j-th component of the mixture. Further, 
defining rij = #{i : z t = j}, y,j = J2i: Zi=j Vi/^j, Z-i = (zi, z i+1 , z n )' and 0_ ip = 

(9i, ... , 9j-i, Oj + i, . . . , Op)', the full conditional distributions of the unknown random variables can 
be expressed as the following: 

[zi = j | Q p ,Z-i,U,Y] oc 7r jx /A7exp|-^-(yj-/Xj) 2 | (8) 
[A^Z.n.e-^^Y] ~ Gamma ( \ | C + '"'^^f + Vi) 



i-.Zi=j 



(9) 



^2 



(ft 'If T- ~\~ £ ' 
' J J t r »— — nI (10) 

[II | Z, @,Y] ~ Dirichlet(m + -fi,...,n p + -fp) (11) 



Perfect sampling, making use of the full conditional distributions available for Gibbs sampling, 
has been developed by M0ller (1999). But the development is based on the assumption that the 
random variables are discrete and that the distribution functions are monotonic in the conditioned 
variables. These are not satisfied in the case of mixtures. Full conditional based perfect sampling 
has also been used by Schneider & Corcoran (2004) in the context of Bayesian variable selection 
in a linear regression model, but their methods depend strongly on the underlying structure of their 
linear regression model and prior assumptions and do not apply to mixture models. Our proposed 
method hinges on obtaining stochastic lower and upper bounds for the Z-part of the Gibbs sampler, 
and simulating only from the two bounding chains, and noting their coalescence. It turns out that, 
in our methodology, there is no need to simulate the other unknowns, (il p , Q p ) before coalescence, 
even in the non-conjugate set-up. Details are provided in the next section. 
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3.3 Bounding chains for Z 

For i = 1, . . . , n, let | Y, Z_i, U p , O p ) denote the distribution function corresponding to the 
full conditional of z^ Writing X_j = (Z_i, U p , O p ), let 

F t L {-\Y) = infF 4 (-|F,X_,) (12) 
F t u (-\Y) = supFj(- | Y,X„i) (13) 

be the lower and the upper bounds of | Y, Z^i, U p , O p ). Note that the full conditional of Zi, 
given by ([8]), is independent of Z-i, hence supremum or infimum over Z_j are not necessary. By 



enforcing bounds on (IT P , 6 P ) the infimum and the supremum in ( 12) and ( 13 ) can be made to be 



bounded away from and 1 for points whose distribution functional values a priori were bounded 



away from and 1. Also, (12) and (13) will take the values or 1 for the points which had, 
respectively, distributional functional values or 1 a priori. In other words, there is a single set 
of points receiving positive masses under the probability mass functions associated with both (fT2l) 



and (13), which we subsequently prove to be distribution functions. This set is also exactly the 
same set of points receiving positive masses under the probability mass function corresponding to 
the distribution function a priori. 

Thus, the support of 6 P would be compact, and that of n p would be a compact subset of its 
original support. This is not an unrealistic assumption since in all practical situations, parameters 
are essentially bounded away from the extreme values. In fact, the prior on the parameters is 
expected to contain at least the information regarding the range of the parameters. In almost all 
practical applications, this range is finite, which, in principle, is possible to elicit. We believe that 
non-compact parameter spaces are assumed only due to the associated analytic advantages (for 
instance, generally integrals are easier to evaluate analytically under the full support) and because 
of the difficulty involved in elicitation of proper priors with truncated support. However, we show 



in Section S-l 1.3 that truncation of the support of Yi p need not always be necessary. 

In order to decide upon some adequate compact support, a pilot Gibbs sampling run with un- 
bounded Q p may be implemented first, and then the effective range of the posterior of P can 
be chosen as the compact support of the prior of O p . This range may be further refined by subse- 
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quently running a Gibbs sampler with the chosen compact support, comparing the resultant density 
with that corresponding to the unbounded support, and, if necessary, modifying the chosen range 
so that the densities agree with each other as closely as possible. It is demonstrated with simulated 



examples in Sections |S-1 1.3[ |4.6[ and with three real applications in Sections |5.1[ |5.2[ and |5.3 
that often the posterior with unbounded support is almost the same as that with compact support, 
obtained from pilot Gibbs sampling. Unless otherwise mentioned, throughout we assume compact 
support of O p . We remark here that the compactness assumption is not needed in the case of con- 



jugate prior on O p . In that case, O p will be integrated out analytically, and hence ( 12 ) and ( 13 ) will 
not involve Q p , thus simplifying proceedings. 

Had the minimizer and the maximizer of F^j \ Y, X_j) with respect to X_i been constant with 



respect to j, then, trivially, ( 12) and ( 13 ) would have been distribution functions. But this is not 
the case unless Z{ takes on only two values with positive probability, as in the case of 2-component 
mixture models. However, as shown in Section S-7 , F^(- \ Y) and Ff(- \ Y) satisfy the properties 



of distribution functions for any discrete random variable. So, their inversions will sandwich all 
possible realizations obtained by inverting | Y, X_j), irrespective of any X_j. 

To clarify the sandwiching argument, we first define the inverse of any distribution function F 
by F~(x) = inf{y : F(y) > x}. Further, let Rz y t = {Rz i: u i — 1, . . . ,n} be a common set of 
iid random numbers used to simulate Z at time t for Markov chains starting at all possible initial 
values. If we define z it = F~(R za | Y, X_j), 4 = if \ Y) and zg = F^{R Zijt \ Y), 

then, for all possible X_j, it holds that z\ t < z it < zf t for i = 1, . . . , n and t = 1,2,.... These 
imply that once all Zi, i — 1, . . . , n, drawn by inverting F t L and F- 7 coalesce, then so will every 
realization of Z drawn from Fi(- \ X_j), for i = 1, . . . , n, starting at all possible initial values. 

Analogous to {R z ,u t = 1,2,...}, let {Rn p ,u t = 1, 2, . . .} and {Re p ,u t = 1,2,...} denote 
sets of iid random numbers needed to generate U p and 6 P , respectively, in a hypothetical CFTP 
algorithm, where Markov chains from all possible starting values are simulated, with Z updated 
first. Once Z coalesces, so will (H p , Q p ) since their full conditionals (see ([9]), |To| and (11)) show 



that the corresponding deterministic random mapping function depends only upon Z, {Ru p ,u t = 
1, 2, . . .}, and {Re p ,u t = 1, 2, . . .}. We remark here that the random numbers can always be 



thought of as realizations of U niform(0, 1), since the deterministic random mapping function can 
always be represented in terms of Uniform(0, 1) random numbers. 

The key idea is illustrated algorithmically below. 
Algorithm 3.1 CFTP for mixtures with known number of components 



(i) For i = 1, . . . , n, and for £ = 1, . . . ,p, calculate F t L (£ | Y) and Fy(£ | Y), given by (12) 



and (13) using some efficient optimization method. We recommend simulated annealing; 



see Section 13771 

(ii) For j = 1 . . ., until coalescence of Z, repeat steps (iii) and (iv) below. 

(iii) Define Sj = {-2 j + 1, . . . , -2 J ' -1 } for j > 2, and let Si = {-1,0}. For each m G Sj, 
generate random numbers Rz, m , Ru p ,m an d R® , m from Uniform(0, 1); once generated, 
treat them as fixed thereafter for all iterations. It is important to note that at step — 2 J no 
random number generation is required since that step can be viewed as the initializing step, 
where all possible chains, from all possible initial values of Z, U p , and Q p , are started. 

(iv) For t = -2P + 1, . . . , -1,0, determine z% = F^~(R Zi t \ Y) and z% = F^~{R Zi t \ Y) 
V i = 1, . . . , n. This step can be thought of as initializing the perfect sampler with all pos- 
sible values of Z and (lip, p ) at step — 2 J , and then moving on to the next forward step 
following generation of Z independently of the previous step, using the above random num- 
bers and optimized distribution functions. Generation of (H p , Q p ) is not necessary because 
of the sandwiching relation zf t < z it < zf t , which holds for any (U p , Q p ), and because co- 
alescence of zf t and i = 1, . . . , n for some t < guarantees coalescence of all chains 
corresponding to (il p , Q p ). 

(v) If zf v = zY t * V i = 1, . . . , n and for some t* < 0, then run the following Gibbs sampling 
steps from t = t* to t = 0: 

(a) Let Z* = (zl, . . . , z*)' denote the coalesced value of Z at time t*. Given Z*, draw 



(II* , 6*) from the full conditionals ( 1 1 ), (M) and ( 10 ) in order, using the corresponding 
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random numbers already generated. Thus, (Z*, II*, 6*) is the coalesced value of the 
unknown quantities at t = t*. Importantly, it is not straightforward to sample from 
full conditionals of non-conjugate distributions and/or in the case of compact param- 
eter spaces. In such situations we recommend rejection sampling/adaptive rejection 



sampling; see Section 3.5 for details. 

(b) Carry forward the above Gibbs sampling chain started at t = t* till t = 0, simulating 
sequentially from ([8]), (11), (|9]) and ( flO] ). Then, the output of the Gibbs sampler ob- 
tained at t = 0, which we denote by (Z , U p0 , Q p o), is a perfect sample from the true 
target posterior distribution. 



Note that here optimization is required only once, in Step (i) of Algorithm |3.1| By reducing the 
gaps between the bounding chains, the algorithm can be made further efficient. Such techniques 



are discussed in Section 3.4 in conjunction with Section S-l 1 In fact, we present a variant of the 



above algorithm for two-component mixtures in Algorithm S- 1 1 . 1 where optimization with respect 
to the mixture component probability is not required. That algorithm exploits a monotonicity 
structure which is not enjoyed by mixtures having more than two components. Indeed, even though 



Algorithm 3 . 1 is applicable for mixtures with any known number of components, Algorithm S- 1 1 . 1 



is applicable only to two-component mixtures. 



It is interesting to note that we need to run just two chains (12) and (13) and check their 
coalescence; there is no need to simulate (lip, Q p ) before coalescence occurs with respect to Z in 
these two bounding chains, even in non-conjugate cases. This property of our methodology has 



some important advantages which are detailed in Section 3.6 



It is proved in Section S-8 that coalescence of Z occurs almost surely in finite time. Foss & 
Tweedie (1998) showed that coalescence occurs in finite time if and only if the underlying Markov 



chain is uniformly ergodic. In Section S-9 we show that our Gibbs sampler, which first updates 
Z, is uniformly ergodic, which is expected thanks to the compact parameter space. The proofs in 



Sections [S^7] , |S-8| and |S-9| go through with the modified bounds needed for mixtures with unknown 
number of components. 
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In conjugate setups further simplification results since we only need to simulate perfectly from 
the posterior of Z\ once a perfect sample of Z is obtained, simulations from (11 ), ([9]), and ( fTO] ) 



ensures exact samples from the posteriors of (H p , Q p ) as well. In order to simulate from the poste- 
rior of Z, we can integrate out (H p , Q p ) from the full conditional of and construct the bounding 
chains with respect to the marginalized distribution function of Z{, optimizing the marginalized 
distribution function with respect to 

3.4 Efficiency of the bounding chains 



It is an important question to ask if the lower bound (12) can be made larger or if the upper 



bound ( 13 ) can be made smaller, to accelerate coalescence. This can be achieved if a monotonicity 



structure can be identified in (n p ,O p ). In Section S-ll we illustrate this with an example. In 



Section 4.5 we propose a method for reducing the gaps between the bounds in mixture models 
with unknown number of components. There it is also discussed that for these models, more 
information in the data can further reduce the gap between the bounding chains. 

3.5 Restricted parameter space and rejection sampling after coalescence 

If our algorithm coalesces at time t < 0, then Gibbs sampling is necessary from that point on till 
time t — 0. The bounds, however, may prevent exact simulation from the full conditionals of Q p 
using conventional methods, such as the Box-Muller transformation (Box & Muller (1958)) in the 
case of normal full conditionals, which becomes truncated normal under the restrictions. In these 
situations, rejection sampling may be used. Briefly, let {R* t ; r = 1, 2, . . .} denote a collection of 
infinite random numbers, to be used sequentially for rejection sampling of the continuous random 
variables at time t by the full conditionals of the continuous random variables. Actual simulation 
using rejection sampling is not necessary until Z coalesces. In the case of non-conjugate priors 
(perhaps, in addition to restricted parameter space), the full conditional densities are often log- 
concave. In such situations the same principle can be used, but with rejection sampling replaced 
by adaptive rejection sampling (Gilks & Wild (1992), Gilks (1992)). 
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3.6 Advantages of our approach 

Our bounding chain approach for only the discrete components Z has several advantages over the 
previous approaches. Firstly, simulation of the continuous parameters before coalescence of Z, 
is unnecessary. This advantage is important because construction of bounds for the continuous 
parameters, even if possible, may not be useful since the coalescence probability of continuous 
parameters corresponding to the bounding chains, is zero. Moreover, bounding the distribution 
functions of continuous parameters in the mixture model context does not seem to be straightfor- 
ward without discretization. Another advantage of our perfect sampling principle is that we do not 
need a partial order of the multi-dimensional state space and it is unnecessary to find minimal and 
maximal elements to serve as initial values of the bounding chains. Indeed, our bounding chains 
begin with simulations from F^(- \ Y) and F 1 C7 (- | Y), which do not require any initial values. 
Also, importantly, our approach of creating bounds for Z does not depend upon the assumption 
of conjugate priors. Exactly the same approach will be used in the case of non-conjugate priors. 
After coalescence, regardless of compact support or non-conjugate priors, Gibbs sampling can be 
carried out in a very straightforward manner till time t = 0. 

3.7 Obtaining infimum and supremum of | Y, X_j) in practice 

For each j = 1, . . . , p, and for alH = 1, . . . , n, the bounds F t L (j | Y) and Ff(j | Y) are bounded 
away from and 1 but not always easily available in closed forms. Numerical optimization using 
simulated annealing (see, for example, Robert & Casella (2004) and the references therein) with 
temperature T oc log (i +t ) , where t is the iteration number, turned out to be very effective in our 
case. This is because the method, when properly tuned, can be quite accurate, and it is entirely 
straightforward to handle constraints (introduced through the restricted parameter space in our 
methodology) with simulated annealing through the acceptance-rejection steps as in Metropolis- 
Hastings algorithm. At each time t a set of fixed random numbers will be used for implementation 
of simulated annealing within our perfect sampling methodology. 

Interestingly, for our perfect sampling algorithm we do not need simulated annealing to be 
arbitrarily accurate; given random numbers {Rz,u t = 1, 2, . . .} we only need it to be accurate 
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enough to generate the same realization from the approximated distribution functions as obtained 
had we used the exact solution. For instance, assume that F t L (j — 1 | Y) < R Zi)t < F t L (j \ Y), 
implying that zf t = j. Letting F t L denote the approximated distribution function, we only need 
the approximation to satisfy F^{j — 1 | V) < R Zl ,t < F t L (j \ Y) so that z\ t = j even under the 
approximation. This is achievable even if arbitrarily accurate approximation is not obtained. Since 
in general there seems to be no way to check the error of approximation by simulated annealing, 
we propose the following method. Instead of a single run of simulated annealing with a fixed run 
length, one may give a few more runs, in each case increasing the length of the run by a moderately 
large integer, and obtain z\ t and zf t in each case. Since the random numbers R z t are fixed (in fact, 
we recommend fixing the random numbers used for simulated annealing as well), the values of zf t 
and zf t will become constants as the run length increases for simulated annealing. These constants 
must be the same as would have been obtained had the optimization been exact. This strategy is 
not excessively burdensome computationally, since there is no need to carry out each run afresh; 
the output of the last iteration of the current run of simulated annealing will be taken as the initial 
value for the next run. 

Our perfect sampling methodology is illustrated in a 2-component normal mixture example 
in Section S-l 1[ here we simply note that our method worked excellently. A further experiment 



associated with the same example, and reported in Section S-11.4[ showed that perfect sampling 



based on simulated annealing yielded results exactly the same as those obtained by perfect sam- 
pling based on exact optimization, in 100% of 100,000 cases. The outcome of the latter experiment 
clearly encourages the use of simulated annealing for optimization in perfect sampling. 

We now extend our perfect sampling methodology to mixtures with unknown number of com- 
ponents, which is a variable-dimensional problem. In this context, the non-parametric approach 
of Escobar & West (1995) and the reversible jump MCMC (RJMCMC) approach of Richardson 
& Green (1997) are pioneering. The former uses Dirichlet process (see, for example, Ferguson 
(1974)) to implicitly induce variability in the number of components, while maintaining a fixed- 
dimensional framework, while the latter directly treats the number of components as unknown, 
dealing directly, in the process, with a variable dimensional framework. The complexities involved 
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with the latter framework makes it difficult to extend our perfect sampling methodology to the case 
of RJMCMC. A new, flexible mixture model based on Dirichlet process has been introduced by 
Bhattacharya (2008) (henceforth, SB), which is shown by Mukhopadhyay, Bhattacharya & Dihidar 
(2011) (see also Mukhopadhyay, Roy & Bhattacharya (2012)) to include Escobar & West (1995) 
as a special case, and is much more efficient and computationally cheap compared to the latter. 
Hence, we develop a perfect sampling methodology for the model of SB, which automatically 
applies to Escobar & West (1995). 

4. PERFECT SAMPLING FOR NORMAL MIXTURES WITH UNKNOWN NUMBER OF 

COMPONENTS 

As before, let Y = (yi, . . . , y n )' denote the available data set. SB considers the following model 

[Vi I Qm] ~ £ V t 6XP - 2 {m ~ Nf (14) 
j= i < J 

In the above, M is the maximum number of components the mixture can possibly have, and is 



known; Q M = {9i, 2 > • • • , &m} with 9j = Qij, Xj), where Xj = a . We further assume that 



M 



are samples drawn from a Dirichlet process: 



9j ~ G 



G ~ DP(aGo) (15) 
Usually a Gamma prior is assigned to the scale parameter a. 



Under the mean distribution G in ( 15 ), 



rid 



Xj ~ Gamma ( — , — ] (16) 



~ Nifio^Xj 1 ) (17) 
Under the Dirichlet process assumption the parameters 9j are coincident with positive proba- 



bility; because of this ( 14) reduces to the form 



[yi I Qm] = E^V ^ ex P {-f (w - ' (18) 
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where {#*, . . . , 6*} are p distinct components in 6m with 9* occurring Mj times, and ttj = Mj/M. 



Using allocation variables Z = (z±, . . . , z n )', SB's model can be represented as follows: For 
i — \,...,n and j — 1, . . . , M, 



As is easily seen and is argued in Mukhopadhyay et al. (2012), setting M = n and Zi — % for 
i — 1, . . . , M(= n), that is, treating Z — (1,2,..., n)' as non-random, yields the Dirichlet process 
mixture model of Escobar & West (1995). 

However, unlike the case of mixtures with fixed number of components, the full conditionals of 
only Z and <d M can not be used to construct an efficient perfect sampling algorithm in the case of 
unknown number of components. This is because the full conditional of 9j given the rest depends 
upon Z as well as 0_jM> which implies that even if Z coalesces, 9j can not coalesce unless 0_jm 
also coalesces. But this has very little probability of happening in one step. Of more concern 
is the fact that Z may again become non-coalescent if M does not coalesce immediately after Z 
coalesces. Hence, although the algorithm will ultimately converge, it may take too many iterations. 
This problem can be bypassed by considering the reparameterized version of the model, based on 
the distinct elements of Q M and the configuration indicators. 

4. 1 Reparameterization using configuration indicators and associated full conditionals 
As before we define the set of allocation variables Z — (z-y,..., z n )' , where z% = j if is from the 
j-th component. Letting Q* M = {#*, . . . , 9* k } denote the distinct components in 0m, the element 
Cj of the configuration vector C = (ci, . . . , cm)' is defined as Cj = £ if and only if 9j = 9\\ 
j = 1, . . . , M, £ = 1, . . . , k. Thus, (Z, 0m) is reparameterized to (Z, C, k, Q* M ), k denoting the 
number of distinct components in M - 

The full conditional distribution of Zi is given by 




(19) 



1 



M 



(20) 




(21) 
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Since 6a/ can be obtained from C and Q* M , we represented the right hand side of pTj ) in terms of 

e M . 

To obtain the full conditional of Cj, first let kj denote the number of distinct values in Q-jM, 
and let Q\ ; £ = 1, . . . , kj denote the distinct values. Also suppose that Q\ occurs My times. 
Then the conditional distribution of Cj is given by 

nq* tj if £ = l,...,kj 
nq j if £ = kj + 1 



where 



Y, Z, C-j, kj,Q A 



Ml 



CM 



a 



( ) 



r(i; 



rijip + 1 



1 

2vr 



?7 + n„- 5 



Mi 



tj 



(2< 



exp 



X- 3 



^ \2 



(22) 



(23) 



(24) 



In (22), (23), and (24), k is the normalizing constant, rij = : z f = j} and yj = J2i- Zi =j Vil n r 



Note that q j is the normalizing constant of the distribution Gj defined by the following: 



[A, 



fas I A i 



rj + rtj 1 



Gamma I - '_ ' 3 , - < ( 1 



- ^o) 



N 



2 ' 2 

njjjjip + no ip 



Tljlp + 



i:Zi=j 



rijijj + 1 ' \j(njijj + 1) 
The conditional posterior distribution of Q\ is given by 



(25) 
(26) 



[6\ | Y, Z, C] ~ Gamma (A* : r/|, £) x JV : /4, V/AJ" ) 



(27) 



where 



j:Cj=£ j-Cj=£ j'-Cj=l 

/4 = ^ n \fi + fMi) I {^n* e + 1) , 

= v/(^* + i), 



n* e +r] 



(28) 

(29) 
(30) 
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and 

G = I l c + " ; i»°+? 2 + E "'<« " *»' + £E - sA • (3D 

^ £ j-Cj=l. j:c j =ei:z i =j J 

It is to be noted that the 9\ are conditionally independent. 

For Gibbs sampling, we first update Z, followed by updating C and the number of distinct 
components k, and finally {9};£ — 1, . . . , k}. 

4.2 Non-conjugate G 

In the case of non-conjugate G (which may have the same density form as a conjugate prior but 
with compact support), g j is not available in closed form. We then modify our Gibbs sampling 
strategy by bringing in auxiliary variables in a way similar to that of Algorithm 8 in Neal (2000). 
To clarify, let 9 a = (/i a , A a ) denote an auxiliary variable (the superscript "a" stands for auxiliary). 
Then, before updating Cj we first simulate from the full conditional distribution of 9 a given the 
current Cj and the rest of the variables as follows: if Cj = Q for some £ ^ j, then 6 a ~ Gq. If, 
on the other hand, Cj ^ c^Vl ^ j, then we set 9 a = 9* c .. Once 9 a is obtained we then replace the 
intractable q j with the tractable expression 



q) = a — — fn-exp 



(2vr; 



m - Hi 1 ' 

Once Cj is simulated, if it is observed that 6>, ^ 9 a Vj, then 9 a is discarded. 



(32) 



4.3 Relabeling C 



Simulation of C by successively simulating from the full conditional distributions (22) incurs a 
labeling problem. For instance, it is possible that all c, are equal even though each of them cor- 
responds to a distinct 9j. For an example, suppose that Q* M consists of M distinct elements, and 
Cj = M Vj. Then although there are actually M distinct components, one ends up obtaining 
just one distinct component. For perfect sampling we create a labeling method which relabels C 
such that the relabeled version, which we denote by S = (si, . . . , sm)', coalesces if C coalesces. 



To construct S we first simulate Cj from (22 1; if Cj G {1, . . . , kj}, then we set 9j = 9*. and if 
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Cj = kj + 1, we draw 6j = 6*. ~ Gj. The elements of 5* are obtained from the following definition 



of Sj~. Sj = £ if and only if 8j = 9}. Note that si = 1 and 1 < Sj < Sj-i + 1. In Section S-10 
it is proved that coalescence of C implies the coalescence of S, irrespective of the value of Q* M 
associated with C. 

4.4 Full conditionals using S 

With the introduction of S it is now required to modify some of the full conditionals of the unknown 
random variables, in addition to introduction of the full conditional distribution of S. The form 



of the full conditional [zi \ Y, S, k, &* M ] remains the same as (21 ), but <d M involved in the right 



hand side of (21 ) is now obtained from S and 0^. The modified full conditional of Cj, which 



we denote by [cj \ Y, Z, S-j, kj, Q* M ], now depends upon S-j, rather than C-j, the notation being 



clear from the context. The form of this full conditional remains the same as ( |22| ) but now the 
distinct components Q\ ; £ = 1, . . . , kj are associated with the corresponding components of S 
rather than C. The form of the modified full conditional distribution of 6* p , which we now denote 



by [9 1 | Y, Z, S, k] , remains the same as ( 27 ), but in equations ( 28 ) to ( 3 1 ), C must be replaced by 
S. In the above full conditionals, k and kj are now assumed to be associated with S. 

The conditional posterior [S | Y, C, ©a/] gives point mass to S*, where S* = (si, ... , s* M )' is 



the relabeling obtained from C and Qm following the method described in Section 4.3 For the 
construction of bounds, the individual full conditionals [sj \ Y, S-j,C, Qm], giving full mass to s*, 
will be considered due to the convenience of dealing with distribution functions of one variable. It 
follows that once Z and C coalesces, S and Q* M must also coalesce. In the next section we describe 
how to construct efficient bounding chains for Z, C and S. Bounding chains for S are not strictly 
necessary as it is possible to optimize the bounds for Z and C with respect to S, but the efficiency 
of the other bounding chains is improved, leading to an improved perfect sampling algorithm, if 
we also construct bounding chains for S. 
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4.5 Bounding chains 

As in the case of mixtures with known number of components, here also the idea of construct- 
ing bounding chains is associated with distribution functions of the discrete random variates, but 
here the bounding chains can be made efficient by fixing the already coalesced individual discrete 
variates while taking the supremum and the infimum of the distribution functions. Moreover, for 
informative data, the full conditional distributions of c, (hence, of Sj) will be similar given any 
values of the conditioned variables; thus the difference between the supremum and the infimum 
of their distribution functions are expected to be small. This particular heuristic is reflected in 
the results of the application of our methodology to three real data sets in Section [5] Also, as 



noted in Section 3.3[ even in the case of unknown number of components, 0^ can be analytically 



marginalized out in conjugate cases, simplifying optimization procedures. The full conditional 
distributions associated with our model, marginalized over Q* M in a conjugate case are provided in 
Mukhopadhyay et al. (2011). 

4.5.1. Bounds for Z Let F Zi (- \ Y, S, k, Q* M ) denote the distribution function of the full con- 
ditional of zu and let F c . (■ | Y, S-j, kj, Q* M ), F Sj (■ | Y, S-j, C, Qm) stand for those of Cj and Sj, 
respectively. Also assume that — oo < Mi < fij < M 2 < oo and < M 3 < \j < M 4 < oo, for 
all j. 

Let S denote the set consisting of only those Sj that have coalesced, and let S~ = S\S consist 
of the remaining Sj. Then 



F% (■ | Y,S) = irif F Zi (- | Y, S, S~ , k, Q* M ) (33) 
FV(-\Y,S) = sup F Zi (-\Y,S,S-,k,Q* M ) (34) 



s-,k,e* 



Clearly, fixing S helps reduce the gap between (33) and (34). The infimum and the supremum 



above can be calculated by simulated annealing. For the proposal mechanism needed for simulated 
annealing with S held fixed, we selected Sj E S~ uniformly from {1, . . . , Sj_i + 1}, where sj_! 
either belongs to S or has been selected uniformly from {1, . . . , Sj_ 2 + !}• Once S is proposed in 
this way, this determines k automatically. We then propose 6*,. . . ,9* k using normal random walk 
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proposals with approximately optimized variance. 



4.5.2. Bounds for C Let Z denote the set of coalesced Zi, and let Z = Z\Z consist of those 
Zj that did not yet coalesce. Then 

F C L .(-\Y,S,Z) = inf f F c .(-|y,£,S-,%,Z,Z-,e^) (35) 

S ,kj,Z ,© M 

F%(-\Y,S,Z) = sup ^ F Cj (-\Y,S,S-,kj,Z,Z-,Q* M ) (36) 

S ,kj,Z ,Q* M 

Note that for S — 0, the supremum corresponds to fc, = 1 and the infimum corresponds to kj = 
M — 1. If 5 7^ 0, the supremum is associated with kj = #S\{sj}, the number of distinct 
components of and the infimum corresponds to the case where kj = # (S U S* - ) 

when all elements of S~ are distinct. Thus, proposal mechanism of S~ for simulated annealing 
is not necessary; manually setting all elements of S~ to be equal for obtaining the supremum and 
manually setting all elements of S~ to be distinct for obtaining the infimum are sufficient, since the 
actual values of the elements of S are unimportant. This strategy dictates the number of distinct 
elements of 9m, and their positions. The proposal mechanism for 6^ may be chosen to be the 
same as that used for obtaining the bounds for while the elements of Z~ may be proposed by 
drawing uniformly from {1, . . . , M}. 

4.5.3. Bounds for S Letting C and C~ = C\C denote the sets of coalesced and the non- 
coalesced cj, the lower and the upper bounds for the distribution function of Sj are 

F^(-\Y,C) = inf F S] (.\Y,C,C-,Q* M ) (37) 
F?(.\Y,C) = sup F 8 .(.\Y,C,C-,e* M ) (38) 

For simplicity let us denote F Sj (- \ Y,C,C~ ,<d* M ) by F Sj (-) suppressing the conditioned vari- 
ables. Since, given C and Q* M , S is uniquely determined, F Sj (k) = or 1, for A: = 1, . . . , M. 
Thus, optimization of F Sj (k) needs to be carried out extremely carefully because either the correct 
optimum or the incorrect optimum will be obtained, leaving no scope for approximation. How- 
ever, simulated annealing is unlikely to perform adequately in this situation. For instance, while 
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maximizing, a long sequence of iterations yielding F s .(k) = does not imply that 1 is not the 
maximum. Similarly, a long sequence of l's while minimizing may mislead one to believe that 
1 is the minimum. In other words, the algorithm does not exhibit gradual move towards the opti- 
mum, making convergence assessment very difficult. So, we propose to construct functions hj(-) 
of F Sj (-)'s and appropriate auxiliary variables such that the optimization of F Sj (■) is embedded in 
the optimization of hj(-), while avoiding the aforementioned problems by allowing gradual move 
towards the optimum. Details are provided below. 

A more convenient optimizing function We construct hj(-) as follows: 

W (i) • a;) i 



i=l 



W ) = 5>^^^ (39) 



where W = (wi, . . . , w M ) denotes the vector of weights, F = (F Sj (l), . . . ,F Sj (M)) and J2jLi 



w j 



1 with Wj > 0, Vj. Clearly, < hj(-) < 1. We represent Wj as Wj = ^/ - , where rij > 0. We 



use simulated annealing to optimize (39) with respect to (W, C~ , <d* M ) but let — > oo with the 
iteration number while simulating other rii, i ^ k randomly from some bounded interval. This 
leads to optimization of F Sj (k), while avoiding the problems of naive simulated annealing. In our 
examples we took oc log(l + t), where t is the iteration number. 

Optimizing strategy Since S is just a relabeled version of C, the distribution functions of the 
full conditionals of Cj and Sj are optimized by the same Q M , provided that none of Sj coalesced 
during optimization in the case of C . All that the proposal mechanism requires then is to simulate 
Cj G C~ uniformly from {1, . . . , M}. If C (= C U C~) and M do not lead to a valid S, then 
the proposal is to be rejected, remaining at the current C~, else the acceptance-rejection step of 
simulated annealing is to be implemented. If, on the other hand, some Sj had coalesced during 
optimization in cj, the optimizer in the case of Sj is expected to be a slight modification of that 
in the case of Cj. We construct the modification as follows. If C, simulated from the bounding 



chains (35) and (36) in the previous step, is not compatible with 6m, then we augment Q* M with 



new components drawn uniformly: fj, ~ U (Mi, M 2 ) and A ~ U (M 3 , M 4 ), in such a manner that 
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compatibility is ensured. We then use the adjusted set of 9m for rest of the annealing steps. This 
scheme worked adequately in all our experiments. Note that if entire C coalesces, then for all j 

and for any Q M associated with C, F S L (■ \Y,C) = (• \Y,C) = F, (• | Y,C,Q M ), which 



implies coalescence of S (recall the discussion in Section 4.4). 



The proof presented in Section S-7 goes through to show that the bounds of the distribution 
functions of (Z, C, S), which are obtained by optimizing the original functions treating the coa- 
lesced random variates as fixed, are also distribution functions. The proof remains valid even if 
the original distribution functions of the discrete variates are optimized with respect to the scale 
parameter a and other hyper-parameters. Optimization with respect to the latter is necessary if 
a and the hyper-parameters are treated as unknowns and must be simulated perfectly, likewise as 
6a/ • Assuming that the original Gibbs sampling algorithm is updated by first updating Z, then 
C, followed by S, and finally Q* M , the proof of coalescence of the random variables in finite time 



is exactly as that provided in Section S-8 The proof of uniform ergodicity presented in Section 



S-9 applies with minor modifications in the current mixture problem with unknown number of 
components. 

Below we provide an algorithmic representation of perfect sampling in mixtures with unknown 
number of components. 

Algorithm 4.1 CFTP for mixtures with unknown number of components 

(1) For j = 1 . . ., until coalescence of (Z, C), repeat steps (2) and (3) below. 

(2) Define Sj = {-2 j + 1, . . . , -2 5 ' -1 } for j > 2, and let Si = {-1,0}. For each m G Sj, 
generate random numbers Rz, m , Rc,m, Rs,m and Ro M , m (again, we shall let these random 
numbers stand for realizations from the uniform distribution on (0, 1)), meant for simulating 
Z, C, S, and Q M respectively. Although random numbers are not necessary for simulating 
S from its optimized full conditionals because of degeneracy, Rs, m will still be used for 
optimizing its distribution function using simulated annealing. The random numbers Re M , m 
will correspond to M distinct components, so that the same set will suffice for smaller num- 
bers of distinct components in the set Qm where all components need not be distinct. The 
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distinct components of Qm are meant to be simulated (but recall that actual simulation is not 
necessary until the coalescence of (Z, C, S)) using those random numbers in the set Re M ,m, 
which correspond to their positions in 6m- 

Once generated, treat the random numbers as fixed thereafter for all iterations. As in Algo- 



rithm 3.1 at step — 2 3 no random number generation is required. 
(3) For t = -2 j + 1, . . . , -1, 0, implement steps (3) (i), (3) (ii) and (3) (iii): 
(i) For % — 1, . . . , n, 

(a) For t = 1, . . . , M, calculate F^{i | Y, S) and F% {£ \Y,S), using the simulated 



annealing method described in Section 4.5.1 
(b) Determine 4 = F^-(R zi)t | Y, S) and z% = F£-(R zi;t \Y,S). As in Algorithm 
3.1 this step can be thought of as initializing the perfect sampler with all possible 
values of (Z, C, S, k, Q* M ) at step — 2 j ; at step — 2 j + 1,5 = (omitting the always 
coalescent s± = 1), signifying that simulations at this forward step is independent 
of the previous step — 2 J . From step — 2 J + 2 onwards, there is positive probability 
that 5^0. Thus, with positive probability, the bounding chains for Z{ will be 
more efficient from this point on. 

(ii) For % = 1, . . . ,M, 

(a) For i = 1, . . . , hi + 1, calculate F^{1 \ Y, S, Z) and F^ (£ | Y, S, Z), using the 



simulated annealing technique described in Section 4.5.2 Recall that the supre- 
mum corresponds to ki = #S\{si}, when S~ contains a single distinct element, 
and the infimum corresponds to the case where ki = # (S U S~) \{si}, when all 
elements of S~ are distinct, and so the set S~ will be set manually to have a single 
distinct element or all distinct elements. 

(b) Set cfi = FV-{R^ t | Y, S, Z) and 4 = F^(R Ci t \ Y, S, Z). 
(iii) Forz = 1, . . . , M, 
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(a) For i = 1,...,M, calculate F£(£ \ Y, C) and {£ \ Y,C), using the methods 
described in Section l4.5.3l 

(b) Since, for some t 6 {1, . . . , M}, F^[£ \ Y, C) = for £ < t and 1 for t > £*, it 
follows that sf t = £*. Similarly, sf t can be determined. 

(4) If, for some t* < 0, z^, = zf r Vi = 1, . . . , n, and cf r = cf t . Vi = 1, . . . , M, then run the 
following Gibbs sampling steps from t = t* to t = 0: 

(a) Let Z* = (z{, . . . , z*)' and C* = (c*, . . . , c* M )' denote the coalesced values of Z and 
C respectively, at time t*. Given (Z*, C*), arbitrarily choose any value of Qm which 
is compatible with C* (one way to ensure compatibility is to choose any 0m having 
M distinct elements); then obtain S* from [S | Y, C, Qm] using the algorithm given in 



Section 4.3 Finally, obtain Q* M from its full conditional distribution, using the random 



numbers already generated. As in Algorithm |3.1[ here also rejection sampling/adaptive 
rejection sampling may be necessary for obtaining Q* M . This yields the coalesced value 

(Z*,C*,S*,e* M ) at time t = t*. 

(b) Using the random numbers already generated, carry forward the above Gibbs sampling 
chain started at t = t* till t = 0, simulating, in order, from the full conditionals of 



(Z,C,S,Q* M ), provided in Sections 4.1 4.2 4.3 and 4.4 Then, the output of the 



Gibbs sampler obtained at t = 0, which we denote by (Z , C , So, @mo)' 1S a P er fect 
sample from the true target posterior distribution. 

4.6 Illustration of perfect simulation in a mixture with maximum two components 

We illustrate our new methodologies in the framework of the mixture model of SB assuming M = 

2. In other words, we consider the model 

1 2 

foi|e 2 ]~ -^iV^/^AT 1 ) (40) 

3=1 

We further assume that A x = A 2 = A, where A is assumed to be known. Hence, 2 = 9 2 ), 
where 9j = j = 1, 2. As in the case of the two-component mixture example detailed in Section 
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S- 1 1 , here also we consider a simplified model for convenience of illustration and to validate the 
reliability of simulated annealing as the optimizing method in our case. 
We specify the prior of /ij as follows: 

fij ~ G, j = 1,2 
G ~ V(aG ), 

(41) 

and fij ~ N(fi , ipX" 1 ) under G . 



We draw 3 observations yi, y 2 , y3, from (j40j) after fixing fi\ = 2.19, /i 2 = 2.73 and A = 20. We 
chose a = 1, /i = 1.98, and ip = 2.33 (the latter two are drawn from normal and inverse gamma 
distributions). Using a pilot Gibbs sampling run we set 0.45 = M\ < /ii, /x 2 < M 2 = 3.7. 

4.6.1. Optimizer for bounding the distribution function of Zi The exact minimizer and the 
maximizer of the distribution function of Zi with respect to 2 or the reparameterized variables 
(S : 62) are of the form (a, b) where each of a and b can take the values y, t , Mi or M 2 . Evaluation of 
the distribution function at these points yields the desired minimum and the maximum at different 
time points t. 

4.6.2. Optimizer for bounding the distribution function ofcj For Cj, the optimizer with respect 
to 2 is given by (a, b) where a and b can take the values fjj, M\ and M 2 . Of course, this is the same 
as what would be obtained by optimizing with respect to the reparameterized version (S, Q>1). As 
before, evaluation of the distribution function at these points is necessary for obtaining the desired 
optimizer. In this case, the optimizer with respect to Z is obtained by considering all possible 
values of Z = (z±, z 2 , z 3 )'. 

4.6.3. Optimizer for bounding the distribution function of Sj No explicit optimization is nec- 
essary to obtain the bounds for Sj, as S = (si, s 2 ) is completely determined by C obtained from 
its corresponding bounding chains. Note that for the four possible values of C — (ci, c 2 ): (1, 1), 
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Hi H2 

Figure 1: Posterior densities of /ii and /i2 using samples obtained from perfect simulation (red 
curve) and independent runs of Gibbs sampling (black curve). The blue curve stands for the pos- 
teriors corresponding to the unbounded support. 

(1, 2), (2, 1), (2, 2), the corresponding values of S = (s h s 2 ) are (1, 1), (1, 2), (1, 1) and (1, 2), 
respectively. 

4.6.4. Results of perfect sampling Results of 100,000 iid perfect samples are displayed in 
Figure [TJ the results are compared with 100, 000 independent Gibbs sampling runs, each time 
discarding the samples obtained in the first 10, 000 Gibbs sampling iterations and retaining only the 
sample in the 10, 001-th iteration. The figure shows that the posterior distributions corresponding 
to perfect sampling (red curve), Gibbs sampling (black curve) in the case of compact supports, as 
well as the posterior corresponding to unbounded support (again, based on 100,000 Gibbs samples, 
each with a burn-in of length 10,000), displayed by the blue curve, agree with each other very 
closely. This is very encouraging, and validates our perfect sampling methodology. 

It is important to remark in this context of validation that a reviewer had expressed concern that 
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number of steps to time t-0 after coalescence 



Figure 2: Probability distribution of the number of steps taken from the time of coalescence till the 
time t = corresponding to Figure[TJ associated with 100,000 iid perfect samples. 
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the period between the time of coalescence till the time t = in our perfect sampling algorithm 
may be long enough to be regarded as a burn-in period, thus rendering our iid perfect samples 
anyway similar to the iid Gibbs samples, thus decreasing the strength of the validation exercise. 
We assure the reader, however, that this is not the case. Indeed, the effective maximum time period 
between the coalsescence time and the time t = in the probability distribution of the period 
between the coalescence time and the time t — 0, displayed in Figure [2| in all 100,000 cases is 
254, which also has very small probability of occurrence (only 35 out of 100,000). The number of 
occurrences of the values between 255 and 506 (the latter being the actual maximum time period 
between the coalsescence time and the time t = 0) is either or 1 (mostly zero), in all 100,000 
cases. In fact, the modal time period between the coalescence time and the time t = is just 2, 
occurring 3582 times. In other words, in most cases it took just two steps to reach time t = after 
coalescence. As a result, the period between the coalescence time and the time t = is just too 
small to be any reasonable burn-in period, and is not comparable to the burn-in period of length 
10,000 of our Gibbs sampler. 

The above arguments show that the agreement between perfect sampling and Gibbs sampling 
in Figure [T] is due to the fact that both the algorithms are correct, the former being exact, and the 
latter being approximate, but very accurate thanks to the long burn-in of length 10,000. 

4.6.5. Validation of simulated annealing in this example As in the example with known number 
of components here also we validate simulated annealing by separately obtaining 100, 000 iid 
samples using our perfect sampling algorithm but using simulated annealing (with 7,000 iterations) 
to optimize the bounds for the distribution functions of (Z, C, S). We have used the same random 
numbers as used in the perfect sampling experiment for obtaining 100, 000 iid samples using the 
exact bounds. All the corresponding samples at time t = turned out to be the same, just as in 
the example of the mixture with exactly two components. This obviously encourages the use of 
simulated annealing in perfect sampling from mixtures with unknown number of components. 
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5 . APPLICATION OF PERFECT SIMULATION TO REAL DATA 
We now consider application of our perfect sampling methodology to three real data sets — Galaxy, 
Acidity, and Enzyme data. Both RG and SB used all the three data sets to illustrate their methodolo- 
gies. The Galaxy data set consists of 82 univariate observations on velocities of galaxies, diverging 
from our own galaxy. The second data concerns an acidity index measured in a sample of 155 lakes 
in north-central Wisconsin. The third data set concerns the distribution of enzymic activity in the 
blood, for an enzyme involved in the metabolism of carcinogenic substances, among a group of 
245 unrelated individuals. 

5. 1 Perfect sampling for Galaxy data 

5.1.1. Determination of appropriate ranges of the parameters We implemented a Gibbs sam- 
pler with M = 10, r\ = 4; £ = 1; // = 20; a a = 10; b a = 0.5; ip — 33.3; and obtained results 
quite similar to that reported in SB, who used M = 30. Using the results obtained in our experi- 
ments, we set the following bounds on the parameters: for j = 1, . . . , M(= 10), 9.5 < /i, < 34.5, 
0.01 < Xj < 5 and 0.08 < a < 35.5. The fit to the data obtained with this set up turned out to be 
similar to that obtained by SB. 

5.1.2. Computational issues We implemented our perfect sampling algorithm with the above- 
mentioned hyperparameter values and parameter ranges. Our experiments suggested that 500 sim- 
ulated annealing iterations for each optimization step are adequate, since further increasing the 
number of iterations did not significantly improve the optima. The terminal chains coalesced af- 
ter 32,768 steps. The reason for the coalescence of the bounding chains after a relatively large 
number of iterations may perhaps be attributed to the inadequate amount of information contained 
in the relatively sparse 82-point data set required to reduce the gap between the bounding chains 



(recall the discussion in Section 4.5). In fact, as it will be seen, perfect sampling with the other 
two data sets containing much more data points and showing comparatively much clear evidence 
of bimodality (particularly the Acidity data set) coalesced in much less number of steps. However, 
compared to the number of steps needed to achieve coalescence, the computation time needed to 
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implement the steps turned out to be more serious. In this Galaxy data, with M = 10, the compu- 
tation time taken by a workstation to implement 32,768 backward iterations turned out to be about 
11 days! We discuss in Section [6] that parallel computing is an effective way to drastically reduce 
computation time. However, we consider another experiment with M = 5 that took just 13 hours 
for implementation, yielding results very similar to those with M — 10. 

5.1.3. Results of implementation After coalescence, we ran the chain forward to time t = 0, 
thus obtaining a perfect sample. We then further generated 15,000 samples using the forward Gibbs 
sampler. The red curve in Figure [3] stands for the posterior predictive density, and the overlapped 
green curve is the the Gibbs sampling based posterior predictive density corresponding to the 
unbounded parameter space. The figure shows that the difference between the posterior predictive 
distributions with respect to bounded and unbounded parameter spaces are negligible, and can 
perhaps be attributed to Monte Carlo error only. The posterior probabilities of the number of 
distinct components being {1, . . . , 10} turned out to be {0, 0, 0.000067, 0.0014, 0.0098, 0.044133, 
0.1358, 0.265133, 0.3436, 0.200067}, respectively. 

5.1.4. Experiment to reduce computation time by setting M = 5 As a possible alternative to 
reduce computation time, we decided to further reduce the value of M to 5. The ranges of the 
parameters when M = 5 turned out to be somewhat larger compared to the case of M = 10: 
for j = 1,...,5, 9.5 < fij < 34.5, 0.01 < Xj < 20 and 0.08 < a < 100. Now the two 
terminal chains coalesced in 2048 steps taking about 13 hours. As before, once the terminal chains 
coalesced, we ran the chain forward to time t = 0, and then further generated 15,000 samples 
using the forward Gibbs sampler. The posterior predictive density is shown in Figure |4} As before, 
the figure shows that the differences between the posterior predictive densities with respect to 
bounded and unbounded parameter spaces are negligible enough to be attributed to Monte Carlo 
error. Moreover, when compared to Figure [3} Figure [4] indicates that the fitted DP-based mixture 
model with M = 5 is not much worse than that with M = 10. Here the posterior probabilities 
of the number of distinct components being {1, 2, 3, 4, 5}, respectively, are {0.000067, 0.001467, 
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Figure 3: Histogram of the Galaxy data and the posterior predictive density corresponding to 
perfect simulation with M = 10 (red curve). The green curve stands for the Gibbs sampling based 
posterior predictive density assuming unbounded parameter space. 




Figure 4: Histogram of the Galaxy data and the posterior predictive density corresponding to 
perfect simulation with M = 5 (red curve). The^reen curve stands for the Gibbs sampling based 
posterior predictive density assuming unbounded parameter space. 




Figure 5: Histogram of the Acidity data and the posterior predictive density corresponding to 
perfect simulation with M — 10 (red curve). The green curve stands for the Gibbs sampling based 
posterior predictive density assuming unbounded parameter space. 



0.026667,0.229733, 0.742067}. 



5.2 Perfect sampling for Acidity data 



Following the procedure detailed in Section 5.1 we set the following bounds: for j — 1, . . . , M(= 
10), 4 < fij < 6.9, 0.08 < Xj < 25, and 0.08 < a < 50. We implemented our perfect sampler 
with these ranges, and with hyperparameters i] = 4, £ = 0.7, /^o = 5.02, a a = 15, b a = 0.5, 
and if} = 33.3. As in the Galaxy data, here also 500 iterations of simulated annealing for each 
optimization step turned out to be sufficient. The terminal chains took about 4 hours to coalesce in 
128 steps. 

The posterior predictive distribution is shown in Figure [5} Again, as before, the figure demon- 
strates that the posterior predictive density remains virtually unchanged whether or not the param- 
eter space is truncated. Figure [5] also indicates that the posterior predictive distribution matches 
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closely with that of the histogram of the data. The posterior probabilities of the number of distinct 
components being {1, . . . , 10} are {0, 0, 0.000067, 0.0024, 0.012, 0.0556, 0.159867, 0.303133, 
0.323067, 0.143867}, respectively. 

5.3 Perfect sampling for Enzyme data 



Following the procedures detailed in Sections 5.1 and 5.2 we fix M = 10; the bounds on the 



parameters are: for j = 1, . . . , M(= 10), 0.15 < ft < 3, 0.08 < Xj < 150.5 and 0.08 < a < 50. 
The hyperparameters in this example are given by r] = 4; ( = 0.33; /io = 1.45; a a = 20; b a = 0.5 
and ip = 33.3. 

We implemented our perfect sampler with these specifications, along with 500 iterations of 
simulated annealing for each optimization step. The terminal chains coalesced in 2048 steps taking 
about 4 days. As to be expected from the previous applications, here also, as shown in Figure [6j 
truncation of the parameter space virtually makes no difference to the resulting posterior predictive 
density associated with unbounded parameter space. Good fit of the model to the data is also 
indicated. The posterior probabilities of the number of distinct components being {1, . . . , 10}, 
respectively, are {0, 0.000933, 0.012067, 0.0634, 0.179, 0.2782, 0.219867, 0.1454, 0.075333, 
0.0258}. 

6. SUMMARY, DISCUSSION AND FUTURE WORK 
We have proposed a novel perfect sampling methodology that works for mixtures where the 
number of components are either known or unknown, and the set-up is either conjugate or non- 
conjugate. We have first developed the method for mixtures with known number of components, 
then extended it to the more important case of mixtures with unknown number of components. 
Our methodology hinges upon exploiting the full conditional distributions of the discrete random 
variables of the problem, optimizing the corresponding distribution functions with respect to the 
conditioned random variables, obtaining upper and lower bounds of the corresponding Gibbs sam- 
plers. One particularly intriguing aspect of this strategy is perhaps the fact that even though perfect 
samples of continuous random variables will also be generated, simulation of the latter is not at 
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Figure 6: Histogram of the Enzyme data and the posterior predictive density corresponding to 
perfect simulation with M = 10 (red curve). The green curve stands for the Gibbs sampling based 
posterior predictive density assuming unbounded parameter space. 
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all required before coalescence of the discrete bounding chains. We have shown that the gaps be- 
tween the upper and the lower bounds of the Gibbs sampler can be narrowed, making way for fast 
coalescence. Further advantages over the existing perfect sampling procedures are also discussed 
in detail. It is also easy to see that our current methodology need not be confined to univariate data, 
and the same methodology goes through for handling multivariate instances. 

With simulation studies we have validated our methodology for mixtures with known, as well 
as with unknown, number of components. However, application to real data sets revealed sub- 
stantial computational burden, and obtaining a single perfect sample took several hours with our 
limited computational resources. Thus, even though the convergence (burn-in) issue is completely 
eliminated, obtaining iid realizations from the posteriors turned out to be infeasible. As discussed 



in Section 5.1 , the difficulties are likely to persist in problems where large values of the maximum 
number of components are plausible, and in sparse data sets. Computational challenges are also 
likely to appear in massive data sets, since then the number of allocation variables for perfect sam- 
pling will increase manyfold. In multivariate data sets too, the computation can be excessively 
burdensome — here the number of discrete simulations necessary remains the same as in the cor- 
responding univariate problem, but optimization with respect to the continuous variables may be 
computationally expensive because of increased dimensionality. In such situations, parallel com- 
puting can be of great help. Indeed, in a parallel computing environment the upper and lower 
bounding chains can be simulated in different parallel processors, which would greatly reduce 
the computation time. Moreover, quite importantly, iid simulations from the posteriors can also 
be carried out easily by simulating perfect samples independently in separate parallel processors. 
This can be done most efficiently by utilizing two processors for each perfect realization, so that, 
say, with 16 parallel processors 8 perfect iid realizations can be obtained in about half the time a 
single perfect realization is generated in a stand-alone machine. The parallel computing procedure 
can be repeated to obtain as many iid realizations as desired within a reasonable time. Increasing 
the number of parallel processors can obviously speed up this procedure many times, which would 
make implementation of our algorithm routine. Although we, the authors, have the expertise in 
parallel computing, we are yet to have access to parallel computing facilities, which is the reason 
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why we could not obtain perfect iid realizations in our real data experiments and could not experi- 
ment with large M or massive data. In the near future, however, such access is expected, and then 
it will be easier for us to elaborate on these computational issues. 
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Supplementary Material 

Throughout, we refer to our main manuscript as MB. 

S-7. PROOF THAT F L j AND F u i ARE DISTRIBUTION FUNCTIONS 
Letting X_j denote all unknown variables other than we need to show that for almost all X-i 
the following holds: 

(i) lim^-oo Ft{h) = lim^oo F?(h) = 0. 

(ii) lirn^if (h) = lim^ Ff(h) = 1. 

(iii) For any x x > x 2 , F^{xx) > F l L (x 2 ) and Fffa) > F l u (x 2 ). 

(iv) linwh Ft{h) = if (x) and lim^ + Ff{h) = F?(x). 

Proof: Let X_i denote all unknown variables other than z\. To prove (i), note that for all h < 1, 
Fi(h | X-i) = for almost all Hence, by g of MB and by definition, both Ff(h) and F?(h) 
are with probability 1. Hence, lini/^-oo Ff(h) = lim^-co F^(h) = almost surely. 

To prove (ii) note that for all h > p, Fi(h \ X-i) = 1 for almost all X-i. Hence, for h > p, 
Ff(h) = Ffih) = 1, that is, lim^oo if 0) = lim^oo F t u (h) = 1 for almost all X-i. 

To show (iii), let hi > h 2 . Then, since | X-i) is a distribution function satisfying mono- 
tonicity, it holds that if t> 2 ) = inf x _, Fi(h 2 \ X-i) < Fi(h 2 | X-i) < Fi(hi | X-i) for almost 
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allX_i. Hence, if (/i 2 ) < m£ x _ t Fi(hi | X^) = if (/n). Similarly, F?(hi) = sup x _. i^ 



X-i) > Fi(hi | X-i) > Fi(h 2 | X_i) for almost all Hence, if^/ii) > sup x _.Fi(/i 



2 

= if (/> 2 ). 

To prove (iv), first observe that due to the monotonicity property (iii), the following hold for 
any x: 

lim Ff(h) > if (x) (42) 

h— >x+ 

lim F?{h) > F t u (x) (43) 

h— >x+ 

Then observe that, due to discreteness, | X_i) is constant in the interval [x, x + 5) for some 
5 > 0. Since the supports of F t L , Ff and | for almost all X_ { are same, F- L and must 



also be constants in [x, x + 5). This implies that equality holds in (42 ) and (43 ). 
Hence, both F t L and Ff satisfy all the properties of distribution functions. 

Remark: The right continuity property formalized by (iv) not be true for continuous variables. 
Suppose X ~ {7(0, 9), 9 > 0. Here the distribution function is F(x \ 9) = f , < x < 9 < oo. 
But 

x 

lim sup — = lim 1 = 1 

z->0+ a 9 x->0+ 



and, 



x 

sup lim — = sup = 



As a consequence of the above problem, attempts to construct suitable stochastic bounds for the 
continuous parameters (JJ P , Q p ) may not be fruitful. In our case such problem does not arise since 
we only need to construct bounds for the discrete random variables to achieve our goal. 

S-8. PROOF OF VALIDITY OF OUR CFTP ALGORITHM 
Theorem: The terminal chains coalesce almost surely in finite time and the value obtained at time 
t = is a a realization from the target distribution. 

Proof: 



38 



Let denote the realization obtained at time t by inverting Ff ', that is, z\ t = Ff (R Zi ,t), 
where {R Zi ,t] i = 1, ■ ■ ■ ,n;t = 1, 2, . . .} is a common set of Uniform(0, 1) random numbers 
which are iid with respect to both i and t, used to simulate Z — (zi, . . . , z n )' at time t for Markov 
chains starting at all possible initial values. Similarly, let zf t = F t L (R Zi ,t)- Clearly, for any 
Zn = F~(R Ziit | X-i) started with any initial value and for any X-i, z\ t < z it < zf t for all i and t. 

For i — 1, . . . , n and for j — 1, 2, . . ., we denote by El the event 



Z i,-2j( ^ ) — z i,-2i( 2? ), 



which signifies that the terminal chains and hence the individual chains started at t = — 2 J will 
coalesce at t = — 2 J ' _1 . It is important to note that both F t L and F? are irreducible which has 
the consequence that the probability of Ej, P(Ef) > ei > 0, for some positive e*. Since, for 
fixed i, {El ;j = 1,2,...} depends only upon the random numbers {R Zi ,t\ t = — 2 J , . . . , — 2 J ~ 1 }, 
{E?;j = 1,2,. . .} are independent with respect to j. Moreover, for fixed j, E\ depends only 
upon the iid random numbers {R z . _ 2 j; i = 1, . . . , n}. Hence, {Ef; i = 1, . . . , n;j = 1, 2, . . .} are 
independent with respect to both i and j. 

Let e = min{ei, . . . , e n }. Then due to independence of \E{ ; i = 1, . . . , n}, it follows that for 
j — 1, 2, . . ., Ei = nf =1 Ef are independent, and 

P (E j ) > e n (44) 

The rest of the proof resembles the proof of Theorem 2 of Casella et al. (2001). In other words, 

T 

P(No coalescence after T iterations) < [J { 1 - P(E j ) } (45) 

= {(l-e n )f^0 as T^oo. (46) 

Thus, the probability of coalescence is 1. That the time to coalesce is almost surely finite follows 
from the Borel-Cantelli lemma, exactly as in Casella et al. (2001). 

The realization obtained at time t = after occurrence of the coalescence event Ej for some j 
yields Z = Z exactly from its marginal posterior distribution. Given this Z , drawing n p0 from 
the full conditional distribution ( 1 1 ) of MB and then drawing pO sequentially from ([9]) and ( 10 ) 
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of MB given Zq and n p0 , yields a realization (Z , n p0 , 6 P o) exactly from the target posterior. The 
proof of this exactness follows readily from the general proof (see, for example, Propp & Wilson 
(1996), Casella et al. (2001)) that if convergent Markov chains coalesce in a CFTP algorithm during 
time t < 0, then the realization obtained at time t = is exactly from the stationary distribution. 

S-9. UNIFORM ERGODICITY 
Let P(-, •) denote a Markov transition kernel where P(x, A) denotes transition from the state x to 
the set A e B, B being the associated Borel a-algebra. If we can show that for all x in the state 
space the following minorization holds: 

P(x,A) > eQ{A), AeB, 

for some < e < 1 and for some probability measure Q(-), then P(-, •) is uniformly ergodic. 
In our mixture model situation the Gibbs sampling transition kernel is 

> ( mf [z^ | n^- 1 ), y] \ [n« | z®, y] [e« | z« n«, r] (47) 



The infimum in inequality (47 ) is finite since both n? _1) and 9? _1) are bounded. 
Denoting the right hand side of inequality (47) by g{Z^\ Hp , Qp ), we put 



e = J2 I I 9(z, n p , e p )du p de p > o. (48) 

Since g(-) is bounded above by the Gibbs transition kernel which integrates to 1, it follows from 



(48) that < e < 1. Hence, identifying the density of the Q-measure as g(-)/e, the minorization 
condition required for establishment of uniform ergodicity of our Gibbs sampling chain is seen to 
hold. 



S-10. PROOF THAT COALESCENCE OF C IMPLIES THE COALESCENCE OF S 
Let C = (ci, . . . , cm)' be coalescent. For convenience of illustration assume that after simulating 
each Cj, followed by drawing 8j depending upon the simulated value of Cj, the entire set S is 
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obtained from the updated set of parameters 6m- Note that in practice, only Sj will be obtained 
immediately after updating Cj and 9j. Let S-j = {si, . . . , Sj-_i, Sj+i, . . . , %}. Then Cj+i = £ 
denotes the £-th distinct element of S-j. If {1, . . . , dj} are the distinct components in S-j, dj 
being the number of distinct components, and I < Sj, then Sj + i = £. On the other hand, if 
£ < Cj+i < dj + 1, then Sj+i = Sj + 1. 

Now note that s\ = 1, which is always coalescent. If c 2 > 1, then s 2 = 2, else s 2 = 1, for 
all Markov chains. Hence, s 2 is coalescent. If c 3 > s 2 , then s 3 = s 2 + 1, else s 3 = c 3 . Since s 2 
is coalescent, then so is s 3 . In general, if c j+1 > Sj, then s j+ i = Sj + 1, else = Cj +1 . Since 
Si , . . . , Sj are coalescent, hence so is Sj +1 , for j = 1, . . . , M — 1. In other words, S 1 must coalesce 
if C coalesces. 

S- 1 1 . ILLUSTRATION OF PERFECT SIMULATION WITH A TWO-COMPONENT 

NORMAL MIXTURE EXAMPLE 
For i — 1, . . . , n, data point yi has the following distribution: 

[ Vi | 7T, 2 ] ~ nN( yi ; ^ A^ 1 ) + (1 - vr)iV(^; /i 2 , A 2 1 ), (49) 

where, for the sake of simplicity in illustration, Ai and A 2 are assumed known. The reason for 
considering this simplified model is two-fold. Firstly, it is easy to explain complicated method- 
ological issues with a simple example. Secondly, the bounds of Z are available exactly in this 
two-component example; the results can then be compared in the same example with approximate 
bounds obtained by simulated annealing. This will validate the use of simulated annealing in our 
methodology. 

The prior of /i,; j = 1, 2, is assumed to be of the form ([5]) of MB. Fixing the true values at 
7r = 0.8, Hi = 2.19 and fi 2 = 2.73, we draw a sample of size n = 3 from a normal mixture where 
o\ = A^ 1 = 0.9, a\ = A 2 1 = 0.5 are considered known. The hyperparameters are set to the 
following values: n = 0.9, r 2 = 0.8, £i = 2.5 and £ 2 = 3.5. We illustrate our methodology in 
drawing samples exactly from the posterior [tc, jii, /i 2 | y\,y 2 , y 3 ]. 
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S- 1 1 . 1 Construction of bounding chains 

To obtain F t L and F^; i = 1,2,3, note that here we only need to minimize and maximize 

F(1 \ X_) = TTy/^expl-^-^) 2 } 

vrv^exp {-%( yi -^i) 2 } + (l-7r)A 2 exp{-f (^-/i 2 )2} 
with respect to yU 2 and ir. Based on a pilot Gibbs sampling run we obtain the following bounds 
for in and /i 2 : M x = 0.2 < /ii < 4.12 = M 2 and M 3 = 1.0 < fJ, 2 < 5.2 = M 4 . The minimizer 



and the maximizer of (50) occur at coordinates of the form (a, b), where a can take the values yi, 



Mi or M 2 , and b can take the values yi, M 3 or M 4 . Evaluating (50) at these coordinates yields 



the desired minimum and the maximum. At time t, let # m i n ,t and 6> max >t denote the minimizer and 



the maximizer, respectively. Minimization and maximization of (50) with respect to n (assuming 
that 0<a<7r<6<lfor some a, b obtained using Gibbs sampling) would have led to the 
independent distribution functions and Ff, but there exists a monotonicity structure in the the 
conditional distribution of n (see also Robert & Casella (2004)) which can be exploited to reduce 
the gaps between Ff 1 and Ff, by keeping n fixed in the lower and the upper bounds. Moreover, 
since optimization with respect to n is no longer needed, truncation of the parameter space of n is 
not required. Details follow. 

S- 1 1 .2 Monotonicity structure in the simulation of n 



It follows from ( 1 1 ) of MB that 7r ~ Beta(rii + 1, n — rii + 1). Then, at time t, n can be represented 
as 

ni+l n+2 

Ttt = Rn,t,k/ / ]RTT,t,k, (51) 

fc=l fc=l 

where {R n j,k, k — 1, . . . , n + 2} is a random sample from Exp(l), that is, the exponential distri- 
bution with mean 1. Thus, n t is increasing with respect to m, since the set of random numbers is 
fixed for all the Markov chains at time t. The form of ( [50] ) suggests that the distribution function 
is increasing with it and hence with n%. Let n u = #{i : z it = 1}, nf t = #{i : zf t = 1} and 
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n it = : z it = 1}' an( i note tna t Tii t < Tin < n\ t for any t. Define 



7T, 



V^li + i p 



u 



k = l - ft 7T,t,fc 



fc=l ft-nfiik 

With these, the lower and upper bounds of the distribution function of Zj at time t are given by 



(52) 
(53) 



u 



(54) 
(55) 



Combining the above developments, we propose the following algorithm for perfect simulation 



in 2-component mixture models, which is a slightly modified version of Algorithm 3.1 For our 
specific example, we must set n = 3 in the algorithm below. 
Algorithm S-ll.l CFTPfor two-component mixtures 

(i) For j = 1 . . ., until coalescence of Z, repeat steps (ii) and (iii) below. 

(ii) Define Sj = {-2 j + 1, . .., -2^ 1 } for j > 2, and let Si = {-1,0}. For each m £ 
Sj, generate random numbers Rz,m> Rn,m and -Re 2im ; once generated, treat them as fixed 
thereafter for all iterations. 

(iii) Fort = -2? + 1, . . . , -1, 0, 



(a) Calculate 7if and 7rf given by (52 ) and (53 ). For t = -2 j + 1, set n\ t = and 



n\ t = n. 



(b) For i — 1, 



n, 



1. For £ = 1,2, calculate if- (£ | n^Y) and iff | 7if given by (54) and 



(55). In this two-component example, these can be calculated exactly following 



the details presented in Sections S-ll.l and S-l 1.2 



2. Determine 4 = F?~{R z . t | rf, F) and = if"(i^ t | Trf , Y). 

(iv) If 4* = Zft, V z and for some t* < 0, then run the following Gibbs sampling steps from 
t = t* to t = 0: 
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(a) Let Z* = (2*, . . . , z*)' denote the coalesced value of Z at time t*. Given Z*, draw 
', ©2) from the full conditionals ( [IT] ), ([9]) and ( 10) in order, using the corresponding 



7T 



random numbers already generated; in fact, ir* will be computed using the representa- 



tion (51 ). Thus, (Z*, it*, ©2) is the coalesced value of the unknown quantities at t = t*. 



(b) Carry forward the above Gibbs sampling chain started at t — t* till t = 0, simulating 
sequentially from ([8]), ( fTTj ), Q and ( fit)] ). Again, 7r will be simulated using pi) . Then, 



the output of the Gibbs sampler obtained at t — 0, which we denote by (Z , 7r , ©2,o)> 
is a perfect sample from the true target posterior distribution. 

S-l 1.3 Results of perfect simulation in the two-component mixture example 



We first investigated the consequences of truncating the parameter space. Figure S-7 illustrates 
that in this example, the exact posterior densities of (n, /ii, /i 2 ) corresponding to bounded and full 
(unbounded) supports are almost indistinguishable from each other. 



We then implemented our perfect sampling algorithm by simulating Z from the bounds (54) 



and (55 ) and simulating the upper and lower chains for tt using the formulae (52) and (53 ). The 
histograms in Figure [S^8| corresponding to 100, 000 iid perfect samples match the exact posteriors 
almost perfectly, indicating that our algorithm has worked really well. 

S-l 1.4 Comparison with perfect sampling involving simulated annealing 

In the same two-component normal mixture example, we considered two versions of our perfect 
sampling algorithm: in the first version we considered exact optimization of the distribution func- 
tion of Zi, and in the second version we used simulated annealing for optimization. In both cases, 
we obtained 100,000 iid samples of (-7T, ^1,^2) at time t = 0, using the same set of random 
numbers. All 100, 000 samples of the second version turned out to be equal to the corresponding 
samples of the first version, suggesting great reliability of simulated annealing. 
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Figure S-7: Investigation of consequences of truncating the parameter space: the solid and the 
broken lines (almost indistinguishable) correspond to the exact posterior densities with respect to 
unbounded and bounded parameter spaces, respectively. 
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Figure S-8: The histograms correspond to perfect samples drawn using our algorithm. The density 
lines correspond to the exact posterior density. 
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